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^_^ Abstract: In order to maintain consistent quality of service, computer 

^^^ network engineers face the task of monitoring the traffic fluctuations on 

the individual links making up the network. However, due to resource con- 
'^i straints and limited access, it is not possible to directly measure all the links. 

Cn Starting with a physically interpretable probabilistic model of networkwide 

traffic, we demonstrate how an expensively obtained set of measurements 

L» ' rnay be used to develop a network— specific model of the traffic across the 

P^ network. This model may then be used in conjunction with easily obtain- 

■^^ able measurements to provide more accurate prediction than is possible 

• with only the inexpensive measurements. We show that the model, once 

^ learned may be used for the same network for many different periods of 

.^^ traffic. Finally, we show an application of the prediction technique to cre- 

^ ate relevant control charts for detection and isolation of shifts in network 

traffic. 

^~H Keywords and phrases: ordinary kriging, network kriging, routing, Net- 

^ Flow. 

\^ 1. Introduction and Preliminaries 

^, 

ly-^ Computer networks consist of nodes (routers and switches) connected by phys- 

(^ ical links (optical or copper wires). Data from one node (called a source) to an- 

^^ other [destination) is sent over the network on predetermined paths, or routes. 

^"^ We will call the stream of data between a particular source/destination pair a 

^ flow. The so-called flow-level traffic may traverse only a single link, if the source 

'/~j and destination nodes are directly connected, or several links, if they are not. 

r\ Also of interest is the aggregate data traversing each link. The traffic on a given 

C^ link is the sum of the traffic of the various flows using the link. 

Both the flow-level and link-level traffic have been studied in the literature. 
Flow level data is expensive to obtain and process, but provides information di- 
rectly about the flows. Data, especially that involving packet delays from source 
to destination, has been used to do something. See some references. On the other 
hand, the link level data is less expensive to obtain, but provides less informa- 
tion about the underlying flows. These data have been studied extensively by 
the field of network tomography. 
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This work examines the problem of predicting the trafhc level on an unob- 
served link via measurements on a subset of the other links in the network. 
Rather than focusing solely on the inexpensive link level data, we also employ 
flow-level measurements to inform a model that can then be utilized to make 
solve the prediction problem. We demonstrate that this model, although ini- 
tially requiring the expensive flow-level data, may be used with a large range of 
link level data from the same network. Thus, this work is the first to combine 
both flow and link level data in an efficient and feasible way. 

The remainder of the paper is laid out as follows. Section 2 discusses the 
computer network framework and the link-level prediction problem in detail. 
Section 3 describes the proposed model that utilizes the flow-level traffic. Section 
4 demonstrates the robustness of the resulting model, while Section 5 illustrates 
a potential application of the methodologies described herein in the case when 
all links may be observed. 

2. Network Modeling and Prediction 

Here, we introduce some notation and motivate our modeling framework in the 
context of network prediction. 

2.1. Global Traffic Modeling in Computer Networks 

Computer networks consist of collections of nodes (routers and switches) con- 
nected by physical links (optical or copper wires) . The networks may be viewed 
as connected directed or undirected graphs. Figure 1 illustrates, for example, the 
topology of the Internet2 backbone network 12, comprised of 9 nodes (routers) 
and 26 unidirectional links. 

Let n, L and J denote the number of nodes, links, and routes, respectively, 
of a given network. Typically, every node can serve both as a source and a desti- 
nation of traffic and thus there are J^ = n{n — 1) different routes corresponding 
to all ordered {source, destination) pairs. Computer traffic from one node to 
another is routed over predetermined sets of links called paths or routes. These 
paths are best described in terms of the routing matrix A — {a£j)Lxj, where 



Qij 



1 link £ used in route ? ^^n^r^^-^rr 

ATI/; ^ A ■ \ ■ l<i< L, 1< J < J. 

Imk I not used m route j _ _ , _ j _ 



The rows of the matrix A correspond to the L links and the columns to the 
Sf routes. In the Internet2 network, for example, the route j from Chicago to 
Kansas City involves only one link, and thus the j-th column of A has a single 
'1' on corresponding to the link connecting the two nodes; similarly, the fc-hop 
routes correspond to columns of A with precisely k I's. 

We are interested in the statistical modeling of the traffic on the entire net- 
work. Let X{t) = {Xj{t))i<j<j be the vector of the traffic flows at time t 
on all JT routes, i.e. between all source-destination pairs. That is, Xj{t), t = 
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Fig 1: The Internet2 topology, with prediction scenario 7 highhghted (see Table 
5). 



1, 2, • • • , is the number of bytes transmitted over route j during the time interval 

{{t — \)h,th], for some fixed /i > 0. Depending on the context, the time units 

[h) can range from a few milliseconds up to several seconds or even minutes. 

Similarly, let Y{t) = (y«(i))i<^<L be the vector of the traffic loads at time t 

over all L links. Figures 2 and 3 illustrate flow- and link-level traffic over the 

Internet2 network. 

Assuming that traffic propagates instantaneously through the network, the 

load Ye{t) on link £ at time t equals the cumulative traffic of all routes using 

this link: 

J 



Y,it) = Y,aijX,{t)= E^.'-w 



jeAe 



where Ae C {f , 
we obtain 



, i7} is the set of routes that involve link £. In matrix notation. 



Y{t)^AX{t). (2.1) 

We shall refer to (2.1) as to the routing equation. In practice, this relationship 
between the flow-level traffic X{t) and the link-level traffic Y{t) is essentially 
exact, provided that the time scale of measurement h is comparable or greater 
than the maximum round trip time (RTT) for packets in the network. In this 
paper, we consider aggregate data over 10 second intervals, a time substantially 
greater the the RTT of a packet over the Internet2 backbone. 

The statistical behavior of the network traffic traces over individual links or 
flows has been studied extensively in the past 20 years. An important bursti- 
ness phenomenon was observed, which rendered the classical telephone traffic 
models based on Poisson and Markov processes inapplicable. Mechanistic and 
physically interpretable models were developed that explain and relate the ob- 
served burstiness to the notions of long-range dependence and self-similarity. 
For more details, see eg Willinger et al. (1995), Taqqu, Willinger and Sherman 
(1997), Park and Wihinger (2000), Mikosch et al. (2002), Stoev, Michailidis and 
Vaughan (2010), and the references therein. 
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Fig 2: Sampled traffic flow {Xj{t)) time series for high, medium, and low volume 
flows on the Internet2 network, recovered from NetFlow data (see Appendix 6.2, 
for technical details). 



Under general conditions, limit theorems show that the cumulative fluctu- 
ations of the traffic about its mean can be naturally modeled with fractional 
Brownian motions (fBm). The fBm's are zero mean Gaussian processes, which 
have stationary increments and are self-similar. For more details, see Willinger 
et al. (2003) or Section 5 below. Other limit regimes are also possible, which lead 
to stable infinite variance models. These models, however, are less robust to net- 
work configurations than the fractional Brownian motion and are encountered 
less frequently in practice (see eg Mikosch and Samorodnitsky (2007)). 

Here, our focus is not on the temporal dependence of traffic, which has been 
studied extensively. Rather, our goal is to model the global structural relation- 
ship between all links on the network at a fixed instant of time, io- Motivated by 
existing single-flow models, we shall suppose that the vector X{t) representing 
the traffic on all flows of the network is multivariate normal: 



X{t)^N{tix{t),^x{t)). 



(2.2) 



As discussed in Stoev, Michailidis and Vaughan (2010), the link-level, as well 
as flow-level traffic can be well-modeled by using multivariate long-range de- 
pendent Gaussian processes. The routing equation (2.1) yields: 



Y{t) ^ N{Afix{t),AJ:x{t)A'). 



(2.3) 



A first natural application and in fact our motivation to develop global net- 
work traffic models comes from the network prediction problem. The perfor- 
mance of our models will be explored and illustrated in this context. 

Network Prediction (Kriging): Observed are the traffic traces on a subset 
of links O C {1, • • • ,L}: 



V{to,m) :={Yf(f), to - m + I < t < to, i e O}. 



(2.4) 
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over the time window Iq — m -\- 1 < t < Iq of size m. 

Obtain estimators y«(io) for the traffic Y^ito), for all unobserved links £ € 
U :— {1, • • • , L}\0 at time to in terms of the data I?(io, 'm-)- 

In view of the Gaussianity of our global traffic model, we focus on linear 
predictors. In the next section, we start by adapting the classical ordinary kriging 
methodology from Geostatistics to the networking context. This is perhaps the 
best that one can do given no prior statistical information about the network 
other than its topology and routing. Even in such a limited setting it turns out 
that one can sometimes obtain useful predictors. 

In many (0,Z//)-scenarios, however, ordinary network kriging does not provide 
helpful results. The main problem is the lack of information about the traffic 
means (and to a lesser degree covariances) on the unobserved links. This issue 
can only be resolved by incorporating additional, network-specific information 
about the traffic. We do so, in Section 3, where by exploring (off-line) large 
NetFlow data sets, we recover estimates of all flows Xj(t)'s in the network. 
These data are used to build a flexible network-specific model that can be 
estimated on-line from link-level measurements. The resulting model will be 
shown to substantially outperform the ordinary network kriging methodology. 

2.2. General Theory 

If the means and covariances are known, then the network prediction problem in 
the previous section becomes the simple kriging problem from Spatial Statistics. 
The best linear unbiased predictor (BLUP) for Y^ ~ F„(i) in terms of Yq = Yo{t) 
is then given by: 

Yu=^^u + ^uoKoiYo - Mo), (2.5) 



where 



Y = { V ' A^y = ,, h and Ey ^ ' 



Y r ^'^ ~ \ u 



ou 



Here WY(t) ~ fJ-vit) and Ey ~ Ey(i) are the mean and the covariance matrix 
of the vector Y{t), which is partitioned into an observed and unobserved com- 
ponents Yo(t) and Yy^{t), respectively. For simplicity, we omit the argument 'i' 
in (2.5), but it is implicitly present in all quantities therein. The (conditional) 
covariance matrix of prediction errors Y^ — Y^ is given by: 



E 



[\J^u ~ J^^u)(j^u ~ J^j l^o I — Etju — ZjuqZjqq l-iou- (2.b) 



The performance of the simple kriging predictor is illustrated in Figure 3 
(top), and Table 1 (2nd column). In these cases, iiyit) and Ey(i) are estimated 
from moving windows of past observations, where data on all links is available 
(see Appendix 6.3). In the context of our network prediction problem, however, 
not all links are observed, the means and covariances are unknown, and the sim- 
ple kriging methodology is not practical. Nevertheless, it provides a theoretically 
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Fig 3: Top: the standard kriging estimator based on past empirical means and 
covariances. This baseline estimator is not available in practice when only a 
subset of links is observed. Bottom: the ordinary kriging estimator. Both plots 
represent Scenario 7: the prediction of link 13 via links 3, 9, and 12, depicted in 
Figure 1. 



optimal benchmark that we will be used to evaluate all methods developed in 
the sequel. 

The relative mean squared errors (ReMSE) of the predictors reported in Table 
1, and in the sequel, are computed as follows: 



ReMSE(f) = J2 rw - ^^wii VE ii^wii'- 



(2.7) 



Here Y{t) is a predictor for the true value Y{t) and || • |j stands for the Euclidean 
norm. The ReMSE's quantify empirically the prediction error relative to the 
energy of the true signal Y{t), over the duration T. In a controlled setting where 
Y{t) is available, the ReMSE's allow us to objectively compare the performance 
of various estimators. 

A first practical solution to the network prediction problem may be obtained 
by using the ordinary kriging methodology. Namely, suppose that we have no 
prior information about the statistical behavior of the traffic over the network, 
but the routing matrix is available. Then reasonable working assumptions are 
that EY{t) = ii{t)i and Sx(i) = cr|(i)Ij, where /^(i) S K and ax{t) > 
are unknown constant parameters, which vary slowly with time t. That is, the 
traffic means of all links are equal to fJ.{t), and the covariance matrix of the 
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Scenario 


Baseline 


2009-02-19 


2009-02-18 


2009-02-20 


2009-02-26 


2009-03-12 


1 


0.0305 


0.4052 


0.4212 


0.4250 


0.4383 


0.3708 


2 


0.0287 


0.1266 


0.1194 


0.1292 


0.1072 


0.1068 


3 


0.0288 


0.3279 


0.3315 


0.3432 


0.3151 


0.3368 


4 


0.0290 


0.4193 


0.4342 


0.4391 


0.3922 


0.4460 


5 


0.0314 


0.1209 


0.0807 


0.0958 


0.0962 


0.1122 


6 


0.0285 


1.0241 


0.8644 


0.9323 


0.8897 


0.6881 


7 


0.0262 


0.1129 


0.1225 


0.1241 


0.1330 


0.1435 


8 


0.0216 


0.0614 


0.0585 


0.0628 


0.0805 


0.0880 


9 


0.0242 


0.1079 


0.1011 


0.1059 


0.1463 


0.1294 


10 


0.0766 


12.6471 


10.5816 


10.2204 


10.6031 


10.1767 


11 


0.0727 


0.8423 


0.7394 


0.6346 


0.6182 


0.7268 


12 


0.0723 


0.2649 


0.2338 


0.2274 


0.2132 


0.2486 



Table 1 

Columns 2 and 3.' ReMSE's (see (2.7)J of the baseline (simple kriging) and ordinary kriging 

estimators for February 19, 2009, in 12 prediction scenarios (Tables 4 and 5). Columns 4 to 

7: ReMSE's for the ordinary kriging estimators over 4 additional days in each scenario. 



flow-level traffic is scalar. Using these assumptions together with the routing 
equation (2.1), one can obtain the best linear unbiased predictor of Yu{t) (see 
Appendix 6.3 and Cressie (1993) for more details). 

Figure 3 (bottom) and Table 1 demonstrate the performance of this ordinary 
kriging methodology. It is certainly uniformly inferior to the benchmark esti- 
mator in Figure 3 (top) and Table 1 (column 2), which involves data on the 
unobserved links. Nevertheless, ordinary kriging may be useful in cases when no 
structural information about the network is available. 

In the following section, we will develop a model that improves substantially 
upon the ordinary kriging methodology. This can be done, however, only by 
utilizing further information about the network. 

3. Network Specific Modeling via NetFlow Data 



3.1. Modeling Traffic Means 



Direct measurements of the flow-level traffic X{t) are very expensive to obtain 
because this would involve examining the entire traffic load of the network, i.e. 
storing and then processing 95-170 Gigabytes of data per day. Modern routers, 
however, implement a mechanism called "NetFlow", which allow for random 
(Bernoulli) sampling of the flow of traversing packets. The routers store impor- 
tant information such as the ports, source and destination IP addresses, etc. from 
a sample of packet headers. Even though in fast backbone networks (eg Inter- 
net2) the practical sampling rates are eg 1 out of 100 packets, the NetFlow mech- 
anism provides unique information about the traffic loads in the network. Using 
a careful mapping procedure, we assigned the sampled packets to one of the 72 
source/destination flows. We thus constructed an estimate {X{t)} ~ {X{t)} of 
the flow-level traffic. Unfortunately, this method is computationally expensive 
to implement, which makes it impractical to use repeatedly, and it is difficult to 
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Fig 4: Left: The columns correspond to local sample means over consecutive 
windows of 2000 seconds for each of the J = 12 flows. Darker shades indicate 
higher values. The data were reconstructed from NetFlow measurements of the 
Internet2 network for Feb 19, 2009. Right: cumulative energy captured by the 
F matrix for increasing values of p (see (3.1) and Proposition 1 below.) 



apply in an on-line fashion. Therefore, the information derived from NetFlow 
can only be viewed as auxiliary data in the context of network prediction. We 
shall use this infornration to build a flexible network-specific model that can be 
estimated on-line. 

Figure 4a illustrates the local means of the source/destination flows as a 
function of time, where the data were derived from an extensive analysis of 
NetFlow measurements. This suggests that a linear model for fix (t) with a few 
constant factors can capture much of the variability in the local means. We 
therefore posit the model 

fix{t)^Fm, (3.1) 



where F is a suitably chosen J^ x p matrix and /3 
Observe that 



/3{t) e RP is a parameter. 



fly = Ajix = AFf3 and also /xy^ = AoFf3. 

Provided p equals iank{AoF) the parameter /3 can be successfully estimated by 
using linear regression from the available data on the observed links. We will 
see that this essentially means that p is no greater than the number of observed 
links \0\. 

Now, our goal is, given p, to choose F optimally so that F/3 can approximate 
best nx with a suitable f3. Consider the sample X{t), 1 < t < n, oi the flow- 
level data derived from the NetFlow mapping, where n = w x n^ . Partition the 
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data into n,,, windows of size w, and let 



— 1 "" ~ 

X{k) = -^X{{k-l)w + i) 



w 

4=1 

(1 < fc < n), be the sample mean the X{t)'s in the fc-th window. 

Consider first the set of n^ points {X{k), 1 < fc < 71^,} in K^ and observe 
that the model in (3.1) postulates that fix belongs to range(i^) (the linear space 
spanned by the columns of F). Thus, given the X(fc)'s, a least squares optimal 
choice of F corresponds to minimizing the sum of the squared distances from 
the X(fc)'s to the p— dimensional subspace W :— range(F). That is, we want to 
find 

n.uj 

W*^ Argmin V |1X(A:) - PM/(^(fc))f , 

where Pw denotes the orthogonal projection onto the subspace W. The following 
result shows that this problem has a simple solution, which corresponds precisely 
to performing principal component analysis (PCA) on a certain matrix. 

Proposition 1. Let x{k) £ E™, 1 < k < n. Consider the positive semidefinite 
m X m matrix B — '^j^^ix{k)x{ky and let B = ^.j^i ^jbjb*,, be its spectral 
decomposition, where bj, 1 < j < m are orthonormal and Ai > A2 > • • • > 
A™>0. 

Set W* = span{6i,...,6p}, I < p < m. Then, for all W < M™ with 
dim(VF) = p, we have that 

rrt m n 

J2\\xik)-Pw4^ik))r^ E A, <Elk(fc)-^M.(x(fc))|p, (3.2) 

fc=l j=P+l k=l 

where Pw denotes the orthogonal projection onto the subspace W . 

The proof is given in Appendix6.4. This result implies that the least square opti- 
mal choice of the matrix i^ is -F = (61, • • • , bp)jxp, where the bi^s, I < i < p are 
the eigenvectors of the largest p eigenvalues of the matrix B — X]fc=i X{k)X{ky. 
Figure 4b shows that just a few PCA factors p are enough to capture a large 
percentage of the local variability of the mean vectors X{t). Figure 5 (top) il- 
lustrates the prediction performance of this model when the covariance matrix 
is known, but the unobserved means are estimated from the model. 



3.2. Modeling the Covariances 

The physical nature of network protocols, the mechanisms of transmission, and 
the user behavior imply strong relationship between the means and the variances 
of traffic traces. This relationship was shown to be ubiquitous over different 
types of computer networks. In the field of network tomography, for example, 
the mean-variance models have been successfully used to resolve challenging 
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Fig 5: To-p: Prediction in Scenario 7 (Table 5 and Figure 1) using the PCA-mean 
model (with p = 2) and the sample covariance matrix. In reality, the sample 
covariances for unobserved links are not available, and this plot merely illustrates 
that the model (3.1) successfully captures the structure of the means. See also 
Figure 3. Bottom: Prediction in Scenario 7 using the complete mean-variance 
model with p = 2. 



identifiability questions (See eg Lawrence et al. (2006); Singhal and Michailidis 
(2007); Vardi (1996); Xi, Michaihdis and Nair (2006)). In our context, we also 
encountered a strong relationship between the means and the variances of traffic 
flows. More precisely, by exploring the sample means Xj (t) and standard errors 
Sj (t) , calculated over a window of traffic data, we observed that 



s,{t)^c{x,{t)r., i<j<j 



(3.3) 



with 7 « 3/4. Namely, the standard error of a source-destination flow Xj{t) is 
proportional to a power of its mean. 

We estimated 7 (as a function of t) by performing log-linear regression of 
Sj{t) versus Xj{t) over j, 1 < j < J ■ The resulting estimates remained 
approximately constant in t and close to 3/4 regardless of the time window used. 
The power-law relationship is remarkably consistent in time and the regression 
diagnostics R^ ~ 80% indicate strong agreement with the model. For more 
details, see Figure 7a and the discussion in Section 4.3. 

Singhal and Michailidis (2007) have shown that the components of the vector 
oi X{t) are, in practice, uncorrelated or at most weakly correlated. The principal 
exception are pairs oi forward and reverse flows, eg the Chicago-Los Angeles and 
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Los Angeles-Chicago. This correlation is arguably due to the feedback mecha- 
nism built in the TCP protocol. Our experience with NetFlow on Internet2 (eg 
Fig. 2 in Stoev, Michailidis and Vaughan (2010)) and limited NS2-simulations 
(NS2) confirm that this correlation is negligible at our time scales of interest, 
provided that the network is not congested. The study of heavy traffic scenar- 
ios beyond the operating characteristics of the network is interesting but it is 
outside the scope of the present work. Therefore, in this paper we shall model 
Ex(t) as a diagonal matrix. 

In view of this analysis, we shall impose the following structure on the flow 
covariances: 

Sx = a2diag(|i^/3p^), (3.4) 

where |ap'>' denotes {\ai\'^'^)f^-^^ for a — {ai)f^^ e M'^, and where nx = Fl3. 

We will show below that the parameter a can be estimated on-line from 
link-level data (Y^'s). On the other hand, 7 is a structural parameter, obtained 
from the off-line analysis of NetFlow data. (See Section 4.3 for more details.) 

3.3. The Joint Model 

Combining the mean and covariance models from the previous two sections, we 
obtain the following complete model: 

Y{t) = AF/3 + <TAdiag{\F/3\'^)Z{t), (3.5) 

where Z{t) ^ J\f{Q,Ij) is a standard normal vector in M-^ and where 13 G MP 
and tJ > are unknown parameters. In this section, we will show how this model 
can be estimated from on-line measurements on a limited set of observed links 
O. We will also establish asymptotic properties of the proposed estimators. 

In the framework of the Network Prediction Problem (see Section 2.1), we 
obtain 

Yoito)^AoFp + eYSto), 

where 

^ m— 1 

Yo{to)^-y2Yo{to~k). (3.6) 

k=0 

To establish the covariance structure of the noise ey , we introduce the mild 
assumption that the flow-level traffic is stationary (in practice, traffic is locally 
stationary on the time scales of interest) and its temporal correlation structure 
is the same across all routes. Namely, that Cori(Xj(t),Xj{t + i)) = p{i), 1 < 
i < TO - 1, {l<j<J). This yields 

Corr(Y^(i + i), Ye{t)) = p{i), 1 < i < to - 1, for aU 1 < ^ < L, 

and consequently 

eyito) ^ M{0,alAodiag{\F/3\^r^Al), (3.7) 
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where 

<Jl^-(l + 2Y,{l~^/m)p{^)). (3.8) 

m \ ^ — ' / 

The structure of the noise variance suggests a natural iterated generahzed 
least squares (iGLS) scheme for the estimation of /3. 

Algorithm: (Iterated GLS) 

(i) Set /3i = [{AoFfAoF]~^{AoFyYo{to) to be the OLS (ordinary least 
squares) estimate of /? and let k :— 1. 



[{AoFYG{pk)AoFY\AoFYG{h)Yo{to). (3.9) 



(li) Set 

where 

G(/3) := [^odiag(|F/3|2^)A^]-i. (3.10) 



(in) Set k := k + 1 and repeat step CziJ. Iterate until ||/3fc+i — /3fc|| falls below 
a certain "convergence" threshold. 

Observe that the temporal correlation structure does not need to be estimated 
here since it appears only in the scalar coefficient cr^ of the noise variance, which 
cancels in (3.9). The above iGLS scheme requires that the matrices involved in 
steps (i) and (ii) be invertible. The following result ensures that this is indeed 
the case under mild natural conditions. 

Proposition 2. Suppose that F(3 > and let Aq he of full row-rank. Then: 
(i) The inverse G(/3) in (3.10) exists, for all 7 > 0. 
(ii) If AqF is of full column-rank, then the inverses 

^GLsiP) :- [{A,FYG{P)A,F]-\ (3.11) 

and [{AoFy AoF]^^ exist and are positive definite. 

The proof is given in the Appendix and the assumptions are discussed in the 
remarks below. 

Now, if the true parameter j3 is known, then as shown in Chapter 3 of Cressie 
(1993), the minimum variance unbiased predictor (MVUP) of ^^(to) given the 
data T) is the standard kriging estimate 

Y^ := A„F/3 + I]„<,(/3)E-;(/3)(i;(to)-^oi^/3) =: /(/3,i;). 

By (2.6) and (3.5), the covariance matrix of the prediction error K„ — y„ equals: 

m.s.e.(y„) = a^l^uuiP) - S„o(/?)S-<,i(;3)Eo„(/3)) , (3.12) 

where 

5=OT = Adiasdi^ffl-)^.= (|"j« ^-<«). (3.13) 
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In practice, consider the plug-in predictor !"„, where /3 is some estimate of /3. 
Namely, 

Since the function / is continuous, the consistency of j3 would then imply that 
the plug-in predictor is a consistent estimator of the MVUP Y^ ■ The following 
result establishes the strong consistency of the iterated GLS estimators /3fc's, 
even in the presence of long-range dependence. It also shows that the /S^'s are 
asymptotically equivalent to the (unavailable) GLS estimator Pols, provided 
k>2. 

Theorem 1. Suppose that A^ and A^F are oj full row and column ranks, re- 
spectively, and let Fj3 > 0. Then: 

(i) For all k > 1, we have F/3k > 0, a.s. as m — > oo. Hence, the estimates 
Pk, k > 1 are well-defined, almost surely, as m -^ oo. 

(a) If p(t) — > 0, as T -^ oo, then for any fixed k > 1, we have 

Pk ^ P, as m ^ oo. (3-14) 

(Hi) For all k > 2, we have that, 

Pk-^GLS = op{(7yn): asm-)- CO, 

where Pgls is the GLS estimate of j3 in the model (3.5), and cr^j is given in 
(3.8). Moreover, ^^{Pgls) = ^m'^GLsiP) with ^gls as in (3.11). 

The proof is given in Appendix 6.4. The asymptotic variance oi /3k, fc > 2 is 
discussed in the remarks below. 

As mentioned above, the scale parameter a of the covariance structure in 
(3.5) is not involved in the formula for the predictors Y^ = fiPkjYo) (see eg 
(3.9)). The parameter a is involved, however, in the expression of the prediction 
error (3.12). Therefore, to gauge the accuracy of prediction, and to be able to 
use our estimators for detection of anomalies (see Section 5 below), one needs 
an estimate of a. 

As in the case of the ordinary kriging estimator (see (6.1) below), a natural 
estimate of a is obtained as follows: 

d^ :=vcc(SyJ*vec(Eo„(^))/vec(S„o(^))*vec(S„<,(^)), (3.15) 

where I]oo(/3) is as in (3.13), Sy^ ~ '^Yoi'^o) is the sample covariance matrix of 
the vector Yq, calculated from past m observations {Yo(tQ — k), < k < m—l}, 
and /3 is an estimate of /3. 

Proposition 3. Assume the conditions of Theorem l(ii). Then, with f3 = 
/3fc, k > 1, for a^ as in (3.15), we have a^ -^' a^ , as m ^ oo. 

The proof is given in the Appendix. 

In the following section, we will address the estimation, validation and the 
applications of the complete model (3.5) by using real and simulated data. We 
conclude this section with a few technical comments. 
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Remarks: 

1. The model in (3.5) is realistic only if F^ > 0. In our experience, the 
estimates /?fe obtained from real network data always satisfy F/3k > 0. This 
is perhaps due to the careful (optimal) choice of the matrix F discussed 
in Section 3.1. 

2. The assumption that Ao is of full row-rank is natural since for prediction 
purposes, one need not include in the set of observed links ones that are 
perfect linear combination of other observed links. In practice, such a 
redundant scenario can arise only in the trivial case when some nodes do 
not generate traffic. 

3. The full column-rank condition on AoF is required for the identifiability 
of /3. If the dimension of /3 is greater than the number of observed links, 
then the model parameters cannot be identified (see also Section 4.3 and 
Figure 8a). In practice, we implement the iGLS procedure by using the 
Moore-Penroze generalized inverse. 

4. Under the assumptions of Theorem 1, one can show that for all fc > 2, 

n0k - mA - PyiKiA^i)] - ^'„Sgls(/3) + o{al), (3.16) 

as ?7i — > oo, where K C W is an arbitrary compact set containing (3 in 
its interior and such that Ff3 > 0, V/3 e K. Thus, the variance of the 
estimators /3fc, k > 2 obtained in practice is essentially asymptotically 
optimal. 

4. Model Validation and Calibration 

In this section, we evaluate our model in the context of traffic prediction. We 
focus on 12 representative scenarios described in Tables 4, 5 and the Appendix 
6.1, below. 

4.1- Performance 

Tables 1 and 2 provide ReMSE's (2.7) for the optimal baseline estimator, the 
ordinary kriging estimator, and our network-specific model (with p = 2), re- 
spectively. 

Scenarios 1-9 represent situations where the observed links share sufficiently 
many flows with the unobserved ones and thus there is enough information for 
relatively accurate prediction. In all these cases, the network-specific model out- 
performs the more naive ordinary kriging method, with an average improvement 
of the ReMSE by 0.1887 points or 18.87%. The difference is as high as 78% and 
as low as 0.8% in favor of the network speciflc model. In most cases and across 
different days our model yields useful predictions with ReMSE's of about 5%. 
Note that the optimal ReMSE's (Table 1, baseline) are about 2.5% in these 
scenarios. 
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In Scenarios 10-12, however, fewer flows are shared by the observed and un- 
observed Unks and hence the accurate prediction is objectively more difficuh. 
Note that the basehne predictor (Table 1) has over twice the ReMSE's in these 
cases as compared to cases 1-9. In Scenarios 11 and 12, the ordinary kriging es- 
timator outperforms the network specific model with differences in the ReMSE's 
between 19% and 105%. The ReMSE's of the ordinary kriging estimator, how- 
ever, are greater than 21.32%. In Scenario 10, our model is superior to ordinary 
kriging for all five days, but the large ReMSE's indicate that neither approach 
is particularly useful in this case. 

This initial comparison shows that the network-specific model improves sig- 
nificantly upon the naive ordinary kriging approach and comes close to the 
optimal ReMSE lower bound. This is so in the cases where the prediction prob- 
lem is well-posed. Scenarios 10-12 illustrate that the accuracy of prediction has 
natural limitations, inherent to the routing of the network, that none of the two 
models can overcome. 

4-2. Robustness over Time 

One apparent limitation of the network specific approach is that it relies on 
expensive fiow-level data {Xj^s) to build the matrix F. Surprisingly, it turns 
out that once the matrix F is obtained from flow-level measurements during a 
single day, it can be successfully used to model the link-level traffic for many 
days in the future. That is, even though the model requires the extensive off-line 
analysis of NetFlow data, once it is built, it can be readily estimated on-line 
using only link-level data and used for several days before it has to be updated. 
It is remarkable that all results in Table 2 are based on a model (i.e. a matrix F) 
learned from Feb 19, 2009 flow-level data. Then, the same model was used to 
predict Ygs in all 12 scenarios for 5 different days. Even a month later, this model 
continues to outperform the ordinary kriging in the first nine scenarios. Figure 
6 shows also that in only one of the 6 scenarios therein we have an appreciable 
increase in the prediction ReMSE's due perhaps to an outdated model. These 
results may be attributed to the fact that the structure of the traffic means 
across all flows in the network, although complex, is relatively constant, and is 
therefore well-captured by the principal components involved in the matrix F. 
The model must be updated should structural changes in the network occur. 

4-3. Calibration 

Applying the model to real data relies on the choice of several parameters, 
such as p, 7, and ra, as described in Section 3. The prediction performance 
is remarkably robust to the choice of these parameters, as discussed in detail 
below. 

• The role of 7: This parameter controls the mean/variance relationship in the 
model (see (3.1) and (3.4)). We observed the relationship (3.3), between the 
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Scenario 


2009-02-19 


2009-02-18 


2009-02-20 


2009-02-26 


2009-03-12 


1 


0.2476 


0.2342 


0.2363 


0.2629 


0.2209 


2 


0.0517 


0.0461 


0.0550 


0.0424 


0.0746 


3 


0.0514 


0.0459 


0.0549 


0.0425 


0.0750 


4 


0.0521 


0.0465 


0.0552 


0.0427 


0.0740 


5 


0.0512 


0.0696 


0.0658 


0.0694 


0.0596 


6 


0.2414 


0.2651 


0.2864 


0.3344 


0.2722 


7 


0.0468 


0.0619 


0.0587 


0.0684 


0.0462 


8 


0.0388 


0.0501 


0.0495 


0.0564 


0.0384 


9 


0.0395 


0.0510 


0.0504 


0.0567 


0.0398 


10 


3.6668 


3.9110 


3.8143 


5.0877 


5.5161 


11 


1.0322 


1.1060 


1.0687 


1.4335 


1.7803 


12 


0.6277 


0.7618 


0.6875 


0.7792 


0.7449 



Table 2 

ReMSE's of the network-specific model. The matrix F was obtained from Feb 19, 2009 

NetFlow data (Xj 's), and then used to fit the model and perform prediction based on link 

data (Yi 's) for five different days. 



Fig 6: ReMSE of network-specific model over time. The model was learned on 
Feb 19, 2009 (1). The matrix is then used to predict the previous day (2), the 
next day (3), a day one week later (4), and a day 4 weeks later (5). Each line 
corresponds to one of the first six scenarios described in Table 5. 
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ReMSE vs Gamma 
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Fig 7: Left: Estimates of 7 over windows of the data (top) and their B? values 
(bottom), obtained from log-hnear regression of the sample standard devia- 
tions against the means across all source-destination routes. Right: ReMSE of 
scenarios 1-9 for choices of 7. Note that the choice of this parameter does not 
significantly alter the performance of the predictor for the values tested. 



sample means Xj(t) and standard deviations Sj(tys obtained from windows 
of the flow-level data. The parameter 7 was estimated by using a log-linear 
regression of Xj{t) versus Sj{t), over j, I < j < J- This was done for a range 
of window sizes and times i, and the estimates were found to be stable and 
7 « 3/4, as can be clearly seen in Figure 7a. Independently, we explored the 
sensitivity of the model to the choice of 7 and found that the ReMSE's are 
robust to all choices 7 £ [0.5, 2]. See Figure 7b the effect of the choice of 7 for 
several of the prediction scenarios. Small values of 7 « 0.5 lead generally to 
slightly better ReMSE as compared to larger 7's. This may be due to the fact 
that the small powers 7 lead to a 'smoother' covariance matrix and hence have 
a regularizing effect. In practice, however, we need not only accurate prediction 
but also adequate models for the variance, in order to have reliable estimates of 
the prediction error. Therefore, we recommend using 7 « 3/4 as inferred from 
the data. 

• The role of p: The parameter p equals the number of principal components 
(columns of the matrix F) used to model the traffic means in (3.1). The pre- 
diction performance is robust to the choice of p, provided that p is less than 
the number of observed links used in prediction. Figure 8a shows the ReMSE's 
for 3 prediction scenarios as a function of p. In Scenarios 6, 7, and 8 the same 
link is predicted via two, three, and seven other links, respectively (see Table 
5). If p exceeds the number of observed links, then the parameter f3 in (3.5) 
is not identifiable, potentially resulting in poor performance. This explains the 
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ReMSE vs Mean Window 











-e- Senario 6 
-A- Senario? 
+ Senario 8 


/ 


r 




X 


+ A 


/ 




\ 


\ 


' / 










' V " 














(a) Role of p 



(b) Mean Window 



Fig 8: Left: ReMSE of scenarios 6, 7, and 8 as a function of p. Performance 
suffers when p exceeds the number of observed hnks, but is otherwise robust to 
the choice of p. Right: ReMSE of scenarios 1-9, as a function of window size m 
used to estimate the means. We used p = 2 in these cases. 



peaks in the ReMSE's at p = 2, 4, and 7 in Scenarios 6-8. Surprisingly, in 
the first two cases the ReMSE's recover as p grows, even in the presence of 
non-identifiabihty. Similar patterns are seen in the other 9 prediction scenarios 
(omitted, for simplicity). The performance of the model remains stable for all 
choices of p less than the number of predictors. 

In light of these results, we advocate using a relatively small value of p, 
such as 2. While a larger value of p can slightly improve prediction errors when 
many links are observed, having small value of p allows one to fit the model in a 
wide variety of prediction scenarios, without sacrificing the overall performance. 
Recall also Figure 4b. 



• The role of the window size m: In practice, at each time point t, the model 
(3.5) is estimated from a window of past m data {Yo{t — fc), < fc < m — 1}. 
Namely, (3 is obtained by using the iGLS algorithm and a from (3.15) (see 
Section 3.3). Figure 8b illustrates the effect of the window size m on the quality 
of prediction. Note that in all scenarios therein the prediction performance is 
rather robust to the choice of m, provided that m > 10. It is remarkable that 
with the exception of 2 out of the 9 shown prediction scenarios the model works 
well even when m is less than 10. 

Recall that the scalar a does not affect the prediction and the ReMSE's 
in Figure 8b depend only on the quality of estimation of /3. The parameter a 
is involved in the prediction error. Our experiments with simulated data (not 
shown here, for simplicity) show that the estimates of a are also robust to the 
window size m. 
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• Convergence of PiCLS'- In practice, the estimates of Pqls stabilize after a few 
iterations. Here we assume the convergence criterion \\[3k — Pk+i\\ < e, with 
e = 0.001, for example. In the figures and tables, however, the algorithm was 
allowed to run for at least 20 iterations to be conservative. 

4-4- Model Mis specification 

Here, we apply our model (3.5) to simulated data that violates the assumption 
of stationarity. Our goal is to understand the limitations of model, when applied 
to network traffic with slowly changing trend. Network flows are simulated using 
independent fractional Gaussian noise (fGn) time series with self-similarity pa- 
rameter H = 0.8 (see Section 5.1, below). This value of H is typical for several 
real network flows that we examined. 

We compare the mean squared prediction errors in the stationary and non- 
stationary regimes. In the stationary case, constant means are added to the 
simulated fGn's to produce realistic traffic flows with constant positive means. 
Non-stationary traffic traces were obtained by adding a sinusoidal trend to all 
simulated stationary flows. In both cases (stationary and non- stationary), link- 
level data was obtained from the simulated flow-level data through the routing 
equation (2.1). 

We focused on prediction Scenario 8 (see Table 5) . We computed the baseline 
'simple kriging' predictor Y by using the known means and covariances of the 
simulated data. We also estimated our model and used it to obtain a predictor 
Y of the unobserved link. 

Table 3 shows the resulting prediction errors as a function of the window size 
used to estimate the parameter /3. 

The empirical error of our estimators are comparable to the optimal MSE's for 
the baseline estimator. This is so even in the presence of non-stationarity. The 
major exception is when in the non-stationary case the window size becomes 
close to the half-period (50) of the sinusoidal trend. This limited experiment 
shows that our model adapts well and it is essentially robust to non-stationary 
trends provided that relatively small window sizes m are used. 



Window 


Baseline 


5 


10 


Networli Specific Model 
25 30 50 75 


100 


200 


Stationary 
Non-Stationary 


2.61 
2.61 


2.93 
4.15 


2.88 
4.15 


2.83 2.82 2.80 2.78 
4.50 4.71 5.42 6.19 


2.77 
4.99 


2.75 
4.98 



Table 3 
Empirical mean squared errors for the baseline (simple kriging) predictor Y and the 
predictor Y , based on our network-specific model with p = 2. Time series of 20,000 

observations were used. 



5. Statistical Detection of Anomalies 

In this section, we present an application of the above methodology to the case 
when all links on the network are observed. In this case, for each link £, one 



Vaughan, Stoev, & Michailidis/ Network Modeling and Prediction 20 

can compare the observed Yi{t) and the predicted Y^it) traffic. If statisticaUy 
significant deviations are encountered, then this can serve as a flag of an anomaly 
or some structural change in the network traffic. 

To illustrate and detect such differences, we use a modified exponentially 
weighted moving average (EWMA) control chart on the differences Yg~Yg. The 
latter have zero means and variances equal to the prediction error, which can 
be estimated from (3.12) and (3.15). 

Although EWMA control charts are widely used and well-studied (see, e.g. 
Box, Luccno and del Carmen Paniagua- Quinones (2009)), they rely on an as- 
sumption of independent or weakly dependent (in t) observations. Computer 
network traffic is long-range dependent (LRD) and the usual variance formula 
used in the EWMA charts does not apply. We show next how these charts can 
be adjusted to account for the presence such dependence. 

5.1. Control charts for long range dependent data 

Consider the EWMA with discount factor cj) G (0, 1) of the time series {^fe}fc>o: 
Zt := (I - 4>){Zt + <t>Zt-i + 4>''Zt-2 + ■■■) (5.1) 



Letting A = 1 — (/>, this moving average may be efficiently updated via Z, 



t — 
XZt + (1 — \)Zt-i. For independent Z^'s, for Var(Zt) — a"^, we have cr| = 

23^cr|. When the ZtS are long-range dependent, however, the latter formula 
underestimates the variance a^, which can lead to frequent false positive alarms. 
Internet traffic traces are well known to exhibit long-range dependence, which 
can be well-modeled by using fractional Gaussian noise (fCn) (see, eg Willinger 
et al. (2003)). Recall that the zero mean Gaussian time series {Zk] is said to be 
fCnif Zfc = BH{k)—BH{k—\), fc > 1, where {-B// (i)}t>o is a fractional Brownian 
motion with self-similarity parameter H £ (0, 1). The process {-Bff (i)}t>o is 
self-similar and has stationary increments. Its covariance function is 

EBHit)BH{s) = y {\t\'" + |s|2« - |i - sp«) , t, s > 0, 

where cr^ = EBh{1)^ = Var(i3/f (1)). Thus, the fCn {Zk} is a stationary time 
series with auto-covariance 

7H(fc) = EiZkZo) = Y(|fc + ll'"" + \k- M^"" - 2k'") . (5.2) 

For more details, see, eg Taqqu (2003). 

The following result provides an expression for the variance a^ of the EWMA 
control chart corresponding to LRD fGn data. 

Proposition 4. For Zt as in (5.1) with Zt's an fGn with self-similarity pa- 
rameter H G (0, 1) and variance a' , we have: 

,. ,;?,_ A^g^ /- 2(l-cos(g))|0|-^^-i 

^^'^^*^ - C^iHT J-oo A2-2A(l-cos0) + 2(l-cos^)^^' ^^'^^ 
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where C2{Hf := 7r/(iJr(2iJ) sin(iJ7r)). 

The proof is given in the Appendix. In practice, the expression in (5.3) is readily 
evaluated by using numerical integration. 



5.2. Simulated Anomalies in Observed Network Traffic 

We now present some examples of using the adjusted EWMA control chart 
for real network data, applied to the differences Yi{t) — Yiit). The mean is 
taken to be for the duration of the control chart, since the predictor Yi(t) 
is unbiased under the model. The variance of the control chart is calculated 
using (5.3), with a^ replaced by a\ (estimated in the prediction procedure). 
The Hurst LRD parameter H of the traffic is obtained by using the wavelet- 
based methods described in Stoev and Taqqu (2003) . In each of the examples, 
a simple mean-shift anomaly is added to one source-destination flow . Each 
figure in this section shows a plot of the observed, predicted, and true trafhc 
(top panels); the control chart of |Yf — Y^j (middle panels), and an indicator of 
whether the process is identified as out of control (bottom panels) . The vertical 
line indicates the onset of the simulated anomaly. 









iS.-. 




(a) Standard EWMA Control Chart 



(b) LRD-adjustcd EWMA Control Chart 



Fig 9: Performance of the standard EWMA control chart for i.i.d. data (left 
plot) and that of the LRD-adjusted chart (right plot). 



Figure 9 demonstrates the importance of the LRD-adjustment for the control 
limits of the EWMA charts. Here, we examine link 13 (Kansas City to Chicago), 
which is predicted using all links sharing at least one flow with it (Scenario 8 in 
Table 5). A simulated anomaly is added to flow 20, which traverses only link 13. 
The standard chart results in far too many false positives, making it difficult to 
see when the anomaly is added. While the adjusted chart still has several false 
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positives (due to high traffic variabihty), the onset of the anomaly is essentially 
detected. 

The second example shows how one can use the chart to determine which 
flow is behaving anomalously. The mean shift was added to flow 6 (Kansas 
City to Atlanta), which traverses two links: 13 (Kansas City to Chicago) and 17 
(Chicago- Atlanta) (see Figure 1). The LRD-adjusted control charts for Links 13 
(Figure 10a) and 17 (not shown), clearly indicate the onset of the anomaly. The 
charts of the other links, not carrying the anomalous flow, (eg link 7 in Figure 
10b) involve just a few false alarms and detect no anomaly. This suggests that 
the flow using links 13 and 17 (that is, flow 6) is experiencing the anomaly. 




Link Carrying Anomalous Flow (13) (b) Link Not Carrying Anomalous Flow (7) 



Fig 10: Detecting an anomalous multi-link flow. The simulated anomaly is added 
to a two-link flow. The links carrying the flow show anomalous behavior (left 
plot), while the rest behave as in the figure on right. 



In the last example, we illustrate a case where the anomaly detection is 
inherently more challenging. We add a mean shift to the relatively long flow 14 
(Seattle to Atlanta), which traverses four links: 3 (Seattle to Salt Lake City), 9 
(Salt Lake City to Kansas City), 13 (Kansas City to Chicago), and 17 (Chicago 
to Atlanta) (see Figure 1). In Figure 11a, the control chart is based on predicting 
Link 17 by using all links that do not carry the anomalous flow. Unfortunately, 
these links do not provide sufficient information to predict Link 17, hence the 
predictor is a relatively 'smooth curve' as compared to the true traffic trace, 
and the error is relatively large. Nevertheless, we can pick up the anomaly. The 
segment with false positive alerts can be explained by the presence of "bias" 
in our model. That is, the model Ff3 is not capturing the fine dynamics of the 
means. Indeed, if we repeat the exercise using simulated traffic (Figure lib), it 
shows that again the predictor is not particularly useful i.e. it yields a smooth 
curve that tracks only the local means. In this situation, however, there is no 
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Obs w/Noi 
Pred 




(a) Real Traffic 



(b) Simulated Traffic 



Fig 11: An anomaly is added to flow 14 (Seattle to Atlanta), and a control chart 
is constructed on the Chicago- Atlanta Link (17), where all 'non-anomalous' 
links are used in the prediction. These links, however, do not provide enough 
information and the predictor is a relatively smooth curve. 



bias and the control chart accurately identifies the onset of the anomaly. 



6. Appendix 

6.1. Internet2 Network Description and Prediction Scenarios 

This appendix provides details concerning the lnternet2 network, as well as the 
12 prediction scenarios that were investigated earlier. 

The Internet2 network consists of 26 unidirectional links. In order to simplify 
notation, each link was assigned an id number. Table 4 provides the mapping 
from the link id numbers to the source and destination of each link, as well as the 
link capacities at the time of data collection. At the time of the data collection, 
all links had a 10 Gb/s capacity, with the exception of 4 links: Chicago to Kansas 
City, Kansas City to Chicago, New York to Washington, and Washington to New 
York. These four links actually were comprised of two 10 Gb/s capacity cables, 
for a total capacity of 20 Gb/s. Similar upgrades have been made to other links 
since the data presented in this work was collected. 

In the preceding analysis, estimators are compared for 12 different prediction 
scenarios. These scenarios are summarized in Table 5. In the scenarios, a total 
of three links are treated as unobserved, and a subset of the remaining links are 
used as predictors. The choice of these links was not entirely arbitrary. Recall 
that the traffic on any single link is equal to the sum of the trafhc of the flows 
that utilize the link. The three unobserved links were chosen to represent a 
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Link ID 


Source — s> Destination 


Capacity 


1,2 


Los Angeles — > Seattle 


10 Gb/s 


3,4 


Seattle -s> Salt Lake City 


10 Gb/s 


5,6 


Los Angeles — 5> Salt Lake City 


10 Gb/s 


7,8 


Los Angeles — !> Houston 


10 Gb/s 


9, 10 


Salt Lake City — > Kansas City 


10 Gb/s 


11, 12 


Kansas City — > Houston 


10 Gb/s 


13, 14 


Kansas City -^ Chicago 


20 Gb/s 


15, 16 


Houston -^ Atlanta 


10 Gb/s 


17, 18 


Chicago — > Atlanta 


10 Gb/s 


19, 20 


Chicago — >• New York 


10 Gb/s 


21, 22 


Chicago — !> Washington 


10 Gb/s 


23, 23 


Atlanta — > Washington 


10 Gb/s 


25, 26 


Washington — > New York 


20 Gb/s 



Table 4 

id's o/ t/ie 26 links of the Internet2 backbone. Odd Link ID's correspond to the forward and 

the even to the reverse; i.e. Link 15 is the Houston to Atlanta link and Link 16 is the 

Atlanta to Houston link. 



Case 


Predicted 


Observed Links 


1 


7 


2,12 


2 


7 


2,12,13,15 


3 


7 


2,12,13,15,23,25 


4 


7 


2,3,9,12,15,21,23,25 


5 


13 


3,7 


6 


13 


3,9 


7 


13 


3,9,12 


8 


13 


3,7,9,12,17,19,21 


9 


13 


2,3,9,12,15,21,23,25 


10 


19 


3,9 


11 


19 


3,9,13 


12 


19 


2,3,9,12,15,21,23,25 



Description of 12 Scenarios (cases) use 



Table 5 

to evaluate the model. 



The choice of predictors 



based on the number of shared traffic flows. The link id's are given in Table . 



range in the utilization level of the links (in terms of number of flows). Link 
7, predicted in scenarios 1-4, is used by a medium number of flows. Link 13, 
predicted in scenarios 5-9, is used by 14 flows, the most of any link. Finally, link 
19 is utilized by a small number of flows. 

The links used as predictors were also chosen based on the number of source/destination 
flows shared with the unobserved link. Namely, for each unobserved link, the 
predictors were selected so that they share at least one source/destination flow 
with the unobserved link. The exceptions are scenarios 4, 9, and 12, which in- 
volve the same set of predictors. In these three scenarios, our goal is to better 
understand the effect of utilization on the performance of the model. 

6.2. Constructing Sampled Flow-Level Data 



In this section, we briefly describe how the X traffic series were created from 
NetFlow data. Rather than provide a complete technical description, we seek 
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to provide intuition for how the data were constructed. NetFlow records consist 
of individual records, each that group a sequence of packets sharing several 
common characteristics including: source and destination IP address, protocol, 
source and destination port, and type of service. These records also include as 
the number of packets, number of bytes, the start, and end time of the group. 
The input and output interfaces were of particular importance in the mapping 
procedure, since they allow us to identify, for each record, the physical address 
from where that stream of packets arrived at the router, and the physical address 
of the next step on the network. In particular, it allows one to determine whether 
or not the series of packets arrived from the Internet2 backbone and whether it 
was sent out on the backbone, or to a different network. In the first pass over 
the data, we create a careful mapping from each (source IP, destination IP) pair 
to a (Source Router, Destination Router) pair. Then we pass through the data 
a second time, using our mapping to assign each observed flow to Source Router 
and Destination Router, and thereby assign each observed packet to one of the 
72 source/destination flows present on the backbone. 

6.3. On the implementation of simple and ordinary kriging 

The baseline estimator in Section 2.2 is essentially the simple kriging predictor, 
which assumes knowledge of the mean and covariance of Y {fiyit) and Sy (i)). 
In practice, we estimate these quantities from moving windows of past data: 
/xy(i) « nvit) :- i Er=i ^(^ - j) and 

Sy(t) « Ey{t) := — - Y.^Y{t - j) - nY{t)){y{t - j) - nY{t)Y- 

m — 1 ^-^ 

The ordinary kriging methodology is used in our first solution of the predic- 
tion problem in Section 2.2. In this case, Ey is modeled by (j\AA^ , where the 
scale fJx is unknown. The means EY^ — iiy sue unknown but assumed to be 
constant across the links 1 < i < L. Here ax and /iy are allowed to vary slowly 
with time t. In contract, to the baseline estimator, we can no longer use p,Y(t) 
and T.Y{t) above since only some links are observed. Under these assumptions, 
the least squares optimal linear predictor of a link Yu{t) from Yo{t) becomes 

Yu{t) ^ AYo{t), where A = (" p°° M 7o„, 

where Too = (E(l£; — Y^ .)^)£;^£.go is a matrix of the variograms for the set of 
observed links O and %u = (E(F„ — Yi)'^)i^o is a vector of cross-variograms 
between the unobserved link and observed links. For more details, see Cressie 
(1993). The resulting ordinary kriging coefiicients are such that 1*A = 1 so 
that the predictor is unbiased. In our application, we calculated the variograms 
by estimating the unknown parameter ax from a window of past data from 
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the observed links Yo{t). Namely, since Sy^ — a\AoA\^, we obtain the Unear 
regression estimate 

a| = [vec(A„^*)*vec(A„A^)]-ivec(Ei.J*vec(A„A^), (6.1) 

where vec(-B) stands for the vectorized matrix B. The estimator u\ corresponds 
to minimizing ||vec(Ey^) — (7'^^ec{AoAl^)\\ with respect to a^ . 

6.4- Proofs 

Proof of Proposition 1. Let W = span{/i,--- ,/p}, where {/i,--- ,/m} is an 
orthonormal basis of M™. Observe that, for aU 1 < fc < n: 

m m 

\\x{k)-Pw{x{k))r^ ^ {x{k),f,r= J2 f'Mk)x{kY)fr 
'j=p+l j=p+l 

Now, by summing over w, we have that the right-hand side of (3.2) equals: 

rri n rn 

j=p+l k=l j=P+l 

Clearly, the last sum is minimized when span{/p_|_i, • • • , /„} — (W*)-^ = spa.n{bp^i, • • • , &„}• 
In this case, this sum equals X]flp+i -^i- ^ 

Proof of Proposition 2. Since Ff3 > 0, the matrix diag(|i^/3p^) is positive defi- 
nite. Thus, the fact that Ao is of full row-rank, implies that the square matrix 
^odiag(|F/3p''')A^ is of full row-rank and hence invertible. This proves (i). 

To show (ii), let X eRP and suppose that x\AoFYG{(3)AoFx = 0. Thus, 
for the vector y = AoFx, we have y*G{l3)y = 0. This, since G'(/3) is positive 
definite, implies that y = AqFx = 0, which in tmn yields a; = 0, because AoF 
has a trivial null-space. We have thus shown that [AoFY G{fi)AoF is a positive 
definite matrix. D 

Proof of Theorem 1. Note that 

Fo(to) = AoFp + aA„diag(|F/3r)Z(io), 

where Z(io) — :^ J2t=t -m+i -^(O- Since p{t) -^ 0, t — > oo, the Maruyama's 
Theorem implies that the Gaussian process Z — {Z{t)}ti=z is mixing. Therefore, 
Z{to) °4' 0, and hence Yo{to) '^-^' AoFI3, as m — > oo. 

Note that the OLS estimator /3i is well-defined since AqF is of full column 
rank. Observe also that, since Yo{to) — ?►' AoF/3, we have 

^1 - /3 = [iAoFYAoF]-\AoFY{Yoito) - A,F(3) ^0, as m ^ oo. 
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We proceed by induction. Let fc > 2 and /3fc_i ^' /3, ttt, — > oo. Then, since 
FP > 0, by continuity, F/3k^i > 0, almost surely, as rn ^- oo, and /3k is well- 
defined, as w — ?► oo (Proposition 2). Further, for all fc > 2, by (3.9), 

A = C{A-i)Yoito), (6.2) 

with the matrix 

C0) := [{AoFYG0)A,F]-\A,FYG0). (6.3) 

Note that C(/3) is well-defined and continuous for F(5 > 0. Since, also C{P)AoFP — 
P, the convergences l^o(io) -^ AoFf], and Pk-i -^ /?, imply that /3fc °^' /?, as 
m — > oo. We have thus shown parts (i) and ('iij. 

We shall now prove (Hi). As in (6.2), for all fc > 2, we have 

Pk^C0k-i)Yo{to) and also 3^gls = C(/3)F„(io). 

Note also that C{/3)AoF/3 = ,3 = C(/3fc)AoF/3, and therefore, 

A - Pgls = {C0k-i) - C(/?))(Fo(to) - AF/3). (6.4) 

Now, by (3.7) and (3.10), we have 

Var(Fo(to)) = crlGipyK (6.5) 

Thus, Relation (6.4), the convergence /3fc_i ^' l3, m — )■ oo, and the continuity 
of C(-) imply that f3k = Pgls = op{am): rn -^- cx). Note also that by (6.3) and 
(6.5), we readily have Va.r0GLs) = crl^^cLsiP)- □ 

Proof of Proposition 3. As in the proof of Theorem 1, the Maruyama's theorem 
implies that Sy^ — >' Sy^, as to — >■ oo. By Theorem 1 (^iij we also have /3fc — >' /3, 
as ?n — ^ cx). Note that the right-hand side of (3.15) is a continuous function of 
j3 and Ey^, which by (3.5), equals ct^ when /3 and Ey^ are replaced by /3 and 
Ey^ , respectively. This implies the strong consistency oia^. D 

Proof of Proposition 4- The variance of Zt is then given by: 

CXD oo OO OO ^TT 

E[Zf] = (l-0)^5:^0^+S(j-fc-) = il-c^fJ2Y.'^'^" / e^'^'-'^f{9)d0, 

j=0 fe=0 j=0 fc=0 '^ 

where /(0) stands for the spectral density of {Zk}- By using the expression of 
/ given in equation 9.12 on p. 34 of Taqqu (2003) , we obtain that IE[^f'^] equals 

2 . . /-"^ on -rnsCfl^^lfl|-(2^+l) 



i-Jfr"o ^^ ^ ' J-oo </>2 + l-2</>cos0 



OO 



n 
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